Conversation
There was a problem hiding this comment.
Pull request overview
Adds a nightly (and manually dispatchable) GPU regression workflow that trains a model, runs a set of log-based sanity checks, converts a checkpoint, and runs inference—plus a few supporting config/doc tweaks.
Changes:
- Add a new GitHub Actions workflow to run GPU regression training + inference with log validators.
- Add helper scripts to validate training signals (loss drop, grad norm, grad sync, state dict keys).
- Reduce distributed log spam by gating
from_pretrainedprints to the main process; update docs/configs for CI usage.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/opentau/policies/pi05/modeling_pi05.py |
Gate verbose loading/remapping prints to main process in distributed runs. |
docs/source/tutorials/inference.rst |
Update inference command to point at OpenTau’s inference script. |
configs/examples/accelerate_deepspeed_config.yaml |
Adjust example accelerate config process count (used by regression workflow). |
configs/dev/ci_config.json |
Update CI training config to use pi05 + smaller action chunking and CI-specific settings. |
.github/workflows/regression_test.yml |
Add nightly GPU regression workflow (start runner, train, validate logs, convert, infer, stop runner). |
.github/workflows/gpu_test.yml |
Update GPU runner ASG name and reduce timeout. |
.github/scripts/utils.py |
Add shared grep_file helper for log parsing. |
.github/scripts/check_state_keys.py |
Add validator for missing/unexpected state dict keys in logs. |
.github/scripts/check_nonzero_grad_norm.py |
Add validator ensuring grad norm is present and non-zero. |
.github/scripts/check_loss_drop.py |
Add validator ensuring (smoothed) loss decreases and resume behavior is sane. |
.github/scripts/check_accumulate_grad_sync.py |
Add validator ensuring accelerator.sync_gradients matches grad accumulation cadence. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| sync_grads = grep_file(arg.log_path, arg.re_pattern, processor=bool) | ||
| assert len(sync_grads) == arg.expected_length, ( | ||
| f"Expected {arg.expected_length} sync_gradients, found {len(sync_grads)} in {arg.log_path}." | ||
| ) | ||
| assert all(sg == ((i + 1) % arg.gradient_accumulation_steps == 0) for i, sg in enumerate(sync_grads)), ( |
There was a problem hiding this comment.
processor=bool will interpret both "True" and "False" strings as True (because any non-empty string is truthy), so sync_grads will be incorrect and the assertion will fail/behave incorrectly. Convert explicitly from the captured string (e.g., map "True"->True and "False"->False) before running the pattern check.
| - name: Set up Libero Configs | ||
| shell: bash | ||
| run: | | ||
| source .venv/bin/activate | ||
| mkdir -p /tmp/libero-assets/libero/libero | ||
| export LIBERO_CONFIG_PATH="$(pwd)/.github/assets/libero" | ||
|
|
There was a problem hiding this comment.
In GitHub Actions, export LIBERO_CONFIG_PATH=... inside a run step only affects that step's shell; it won’t persist to later steps like "Run Training". If the training/inference needs this env var, write it to $GITHUB_ENV (or set it under the job/step env:) so it’s available in subsequent steps.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
What this does
Runs GPU Regression Tests. The tests consist of training, resuming, and running inference on the model.
How it was tested
I ran it on Github Actions
see (https://github.com/TensorAuto/OpenTau/actions/runs/21304126877/job/61328345455?pr=85)
How to checkout & try? (for the reviewer)
Dispatch on github actions
Checklist
Note: Before submitting this PR, please read the contributor guideline.